In [1]:
# necessary package imports defined
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from umap import UMAP

# define notebook options
pd.options.display.max_columns = 100

# set style - I like 538 for the aesthetics
plt.style.use('fivethirtyeight')

%matplotlib inline

Lab One: Exploring Table Data¶

Team: Mike Wisniewski¶

Business Understanding¶

With business and travel revving back up after a hiatus due to COVID-19, hotel companies now have a need to predict hotel capacity as numbers begin to stabilize.

Two different hotels (a city-based hotel and a resort-based hotel) have opened up their survey results from a series of surveys conducted between 2015-2017. This set of data contains 119,390 observations along with 35 features and 1 output predictor, of which I chose myself (is_canceled is the prediction).

This notebook is intended to provide visualizations for features and feature relationships with other features. These visualizations are intended to help hotel management understand what are the most important causes to book cancelations.

This model needs to justify itself by being able to perform with a higher classification accuracy for cancelations than current hotel capacity models. I believe the value-add from a high level view of this data is to give a sense to hotel management what actionable next steps they can take to reduce cancelations and solidify a more solid forecast of room demand.


Dataset provided by Kaggle: https://www.kaggle.com/datasets/mojtaba142/hotel-booking

Author's Note: This dataset was carefully chosen and verified outside of Kaggle as Kaggle has a problem with fake or shady-sourced datasets that are not verifiable outside of Kaggle. I was able to verify the survey from the following: https://www.sciencedirect.com/science/article/pii/S2352340918315191. Although this does not 100% prove that this is real data, this is as best as I could get from Kaggle without picking one of the more obvious datasets (Iris, Titanic, Heart/Stroke/Diabetes detection, mushrooms, etc.)

The overall question: who is most likely to cancel their reservations? And if that's determinable what is the root cause?

Data Understanding¶

Attribute Definitions¶

In [2]:
# load in the dataset
df = pd.read_csv("hotel_booking.csv")
df.head()
Out[2]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date name email phone-number credit_card
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 0.0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01 Ernest Barnes Ernest.Barnes31@outlook.com 669-792-1661 ************4322
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 0.0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01 Andrea Baker Andrea_Baker94@aol.com 858-637-6955 ************9157
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 0.0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02 Rebecca Parker Rebecca_Parker@comcast.net 652-885-2745 ************3734
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 0.0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02 Laura Murray Laura_M@gmail.com 364-656-8427 ************5677
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 0.0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03 Linda Hines LHines@verizon.com 713-226-5883 ************5498
In [3]:
# create a new feature called children and babies
df["children_and_babies"] = df["children"] + df["babies"]

Key Attributes and Assumptions¶

Per data documentation, the last 4 columns are synthetically generated and thus will be discarded. Additionally, I have chosen only 10 features (1 predict) for this excercise, but I truly believe each feature in this dataset is relevant to answering the overall quesiton (except the aforementioned synthetic data). I chose these features based on what I perceive as most likely to explain variance in cancelations. I will be using all relevant features in a later excercise when I use UMAP on the dataset to reduce the dimensions.

The features I've chosen:

-lead_time
-stays_in_weekend_nights
-stays_in_week_nights
-children_and_babies (I will be combining children and babies into 1 for now)
-market_segment
-is_repeated_guest
-previous_cancellations
-booking_changes
-assigned_room_type
-deposit_type
-is_canceled (prediction)
In [4]:
# extract metadata for the entire dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 37 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
 32  name                            119390 non-null  object 
 33  email                           119390 non-null  object 
 34  phone-number                    119390 non-null  object 
 35  credit_card                     119390 non-null  object 
 36  children_and_babies             119386 non-null  float64
dtypes: float64(5), int64(16), object(16)
memory usage: 33.7+ MB

Data Types¶

Observing my selected 10 features, all datatypes appear correct except children_and_babies. children_and_babies is a float but should be an int. This feature is a float because looking at the previous info() output, babies was imported as a float. I'm unaware of partial babies, so it is safe to assume that this should be an integer.

Nulls¶

Notice that three features (children_and_babies, agent, company) contain null values. It is a safe assumption based on the documentation that null values here indicate a lack of agent used or company booked through (which make sense because this data contains individuals doing it themselves). With children_and_babies, it is safe to assume a null is a 0 value, as there are only 4 records with null values and this should not skew results

Therefore, when I use the entire dataset, I will simply replace NULLs with "None" as a string (for agent, company). For the selected feature set I will replace NULL with 0 (for children_and_babies)

Hypothetically, if there were NULL values in features that should not have NULL vales, I would determine the method on strength of correlation to prediction (using a correlation matrix). A low strength of correlation would warrant a mean or median method because those features may not matter as much. A stronger correlation may require using a regression to fill in nulls because these features would most likely explain the variance.

In [5]:
# replace nulls
df["children_and_babies"] = df["children_and_babies"].replace(np.nan, 0)

# cast feature as int
df["children_and_babies"] = df["children_and_babies"].astype(int)
In [6]:
# filter feature set down to chosen features
feature_set = ["lead_time", 
                "stays_in_weekend_nights",
                "stays_in_week_nights",
                "children_and_babies",
                "market_segment",
                "is_repeated_guest",
                "previous_cancellations",
                "booking_changes",
                "assigned_room_type",
                "deposit_type",
                "is_canceled"]

df_fil_features = df[feature_set]

# extract meta data of dataframe for the feature set
df_fil_features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   lead_time                119390 non-null  int64 
 1   stays_in_weekend_nights  119390 non-null  int64 
 2   stays_in_week_nights     119390 non-null  int64 
 3   children_and_babies      119390 non-null  int32 
 4   market_segment           119390 non-null  object
 5   is_repeated_guest        119390 non-null  int64 
 6   previous_cancellations   119390 non-null  int64 
 7   booking_changes          119390 non-null  int64 
 8   assigned_room_type       119390 non-null  object
 9   deposit_type             119390 non-null  object
 10  is_canceled              119390 non-null  int64 
dtypes: int32(1), int64(7), object(3)
memory usage: 9.6+ MB

Data Summarization and Dictionary¶

In [7]:
# summarize dataset for just the feature set
df_summary = df_fil_features.describe()
df_summary
Out[7]:
lead_time stays_in_weekend_nights stays_in_week_nights children_and_babies is_repeated_guest previous_cancellations booking_changes is_canceled
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 104.011416 0.927599 2.500302 0.111835 0.031912 0.087118 0.221124 0.370416
std 106.863097 0.998613 1.908286 0.412561 0.175767 0.844336 0.652306 0.482918
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 18.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 69.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 160.000000 2.000000 3.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 737.000000 19.000000 50.000000 10.000000 1.000000 26.000000 21.000000 1.000000

In this next block of code below, I took inspiration from the graded_example. In my line of work (also AI/ML) I have never seen a description table such as this done in Python (usually done in excel and it's called a data dictionary) and think that's a good idea to have.

In [8]:
data_dictionary = pd.DataFrame()

data_dictionary["Features"] = df_fil_features.columns
data_dictionary["Definition"] = ["Number of days booked before reservation date",
                                 'Number of nights stayed during the weekend',
                                 "Number of nights stayed during the week",
                                 "Number of children and babies",
                                 "Hotel description of customer type",
                                 "Flag indicator for if customer has stayed before",
                                 "Number of previous cancelations",
                                 "Number of changes to a booking",
                                 "Type of room",
                                 "Type of payment/deposit",
                                 "Whether customer canceled or not"]


data_dictionary["Measurement Type"] = ["Ratio", "Ratio", "Ratio", "Ratio", "Nominal", "Nominal", "Ratio", "Ratio", "Nominal", "Nominal", "Nominal"]
data_dictionary["Discrete/Continous"] = ["Discrete"] * 11
data_dictionary["Range"] = ["0 - 737", "0 - 19", "0 - 50", "0 - 10"] + [repr(set(df_fil_features["market_segment"]))] + \
                           ["0: No, 1: Yes", "0 - 26", "0 - 21"] + [repr(set(df_fil_features["assigned_room_type"]))] + \
                           [repr(set(df_fil_features["deposit_type"]))] + ["0: No, 1: Yes"]
data_dictionary["Number of Unique Values"] = [
    len(df_fil_features["lead_time"].unique()),
    len(df_fil_features["stays_in_weekend_nights"].unique()),
    len(df_fil_features["stays_in_week_nights"].unique()),
    len(df_fil_features["children_and_babies"].unique()),
    len(df_fil_features["market_segment"].unique()),
    len(df_fil_features["is_repeated_guest"].unique()),
    len(df_fil_features["previous_cancellations"].unique()),
    len(df_fil_features["booking_changes"].unique()),
    len(df_fil_features["assigned_room_type"].unique()),
    len(df_fil_features["deposit_type"].unique()),
    len(df_fil_features["is_canceled"].unique()),
]

data_dictionary
Out[8]:
Features Definition Measurement Type Discrete/Continous Range Number of Unique Values
0 lead_time Number of days booked before reservation date Ratio Discrete 0 - 737 479
1 stays_in_weekend_nights Number of nights stayed during the weekend Ratio Discrete 0 - 19 17
2 stays_in_week_nights Number of nights stayed during the week Ratio Discrete 0 - 50 35
3 children_and_babies Number of children and babies Ratio Discrete 0 - 10 6
4 market_segment Hotel description of customer type Nominal Discrete {'Corporate', 'Direct', 'Complementary', 'Grou... 8
5 is_repeated_guest Flag indicator for if customer has stayed before Nominal Discrete 0: No, 1: Yes 2
6 previous_cancellations Number of previous cancelations Ratio Discrete 0 - 26 15
7 booking_changes Number of changes to a booking Ratio Discrete 0 - 21 21
8 assigned_room_type Type of room Nominal Discrete {'H', 'K', 'F', 'E', 'G', 'D', 'B', 'P', 'C', ... 12
9 deposit_type Type of payment/deposit Nominal Discrete {'No Deposit', 'Non Refund', 'Refundable'} 3
10 is_canceled Whether customer canceled or not Nominal Discrete 0: No, 1: Yes 2

Per data source documentation, each record is truly unique for the original dataset. I believe there might have been data preprocessing prior to obtaining this dataset. For completion sake, I will show that each record is unique. However, notice that when performing this same exercise on the feature set, there "appears" to be duplicates. I will not drop these records as I know they are unique

In [9]:
print("Original length of df: ", len(df))
print("Length of df after dropping duplicates: ", len(df.drop_duplicates()))
print("Original length of df_fil_features: ", len(df_fil_features))
print("Length of df_fil_features after dropping duplicates: ", len(df_fil_features.drop_duplicates()))
Original length of df:  119390
Length of df after dropping duplicates:  119390
Original length of df_fil_features:  119390
Length of df_fil_features after dropping duplicates:  48904

Data Visualizations¶

In this section, we explore some data visualizations that try to tackle the overall question: who is most likely to cancel their reservations? And if that's determinable what is the root cause? Although we start with this question, we may be asking ourselves more questions along the way. If the case, I will address the question and provide any visualizations that are appropriate

I hypothesize that we are able to answer the first part of this question through visualizations - but the second part might need some more deep-rooted analysis. The first couple of visualizations will show different features with respect to the cancelation status (i.e. number of room type vs cancelation - to use a simple example)

Categorical¶

To start things simple, let's see what the data looks like grouped by market_segment

In [10]:
# group dataset by market_segment
df_ms = df_fil_features.groupby(["market_segment"]).agg({"is_canceled":"sum", "market_segment":"count"})
df_ms = df_ms.rename(columns={"market_segment":"record_count"})
In [11]:
# calculate the % canceled
df_ms["cancelation_rate"] = df_ms["is_canceled"] / df_ms["record_count"]
In [12]:
# define a function to plot a standard barh plot.  Will be reused throughout the notebook
def barh_plot(x, sort_by, df, x_label, y_label):

    # sort the values in descending order and plot a horizontal bar chart
    axis = df[f"{sort_by}"].sort_values().plot.barh(df[f"{x}"], color="c", ec="k", lw=0.5, alpha=0.65)
    axis.grid(color="k", alpha=0.65)
    axis.grid(visible=False, which="major", axis="y")

    # title
    plt.title(f"Cancelation Ratio by {y_label}")
    plt.ylabel(f"{y_label}")
    plt.xlabel(f"{x_label}")
In [13]:
# plot cancelation ratio by market segment
barh_plot("cancelation_rate", "cancelation_rate", df_ms, "Ratio Canceled", "Market Segment")

My first immediate thought is how many records reside in Undefined. If there's a significant amount of records in Underfined, we may need to explore why it's at 100% cancelation and if this in error or not. First I will plot a simple count via bar chart then I will plot a pareto to illustrate our 80/20

In [14]:
# plot number of records by market segment
barh_plot("cancelation_rate", "record_count", df_ms, "Number of Records", "Market Segment")

As suspected, "Undefined" has negligible records, therefore we do not need to further investigate. For completion sake, below is the pareto on top of % canceled

In [15]:
def pareto_bar(x_bar, y_bar, x_par, y_par, x_label, y_label, y_label_par):
    # because we have two plots, we must generate a subplot to render both graphs on the same plot
    f, axis = plt.subplots()

    # first plot is a bar graph
    axis.bar(x_bar, y_bar, color="c")

    # second plot is a pareto plot
    pareto_axis = axis.twinx()
    pareto_axis.plot(x_par, y_par, color="r", ms=7)

    # config plot settings to look nice
    axis.grid(color="k", alpha=0.65)
    pareto_axis.grid(color="k", alpha=0.65)
    axis.grid(visible=False, which="major", axis="x")
    pareto_axis.grid(visible=False, which="major")
    axis.set_xlabel(f"{x_label}")
    axis.set_ylabel(f"{y_label}")
    pareto_axis.set_ylabel(f"{y_label_par}")
    axis.set_xticklabels(labels=x_bar, rotation = 45, ha="right")

    # plot
    plt.show()
In [16]:
# sort the dataframe
df_ms = df_ms.sort_values("cancelation_rate", ascending=False)

# calculate the cumulative % of records per market segment
df_ms["c_percentage"] = df_ms["record_count"].cumsum() / df_ms["record_count"].sum() * 100

# plot the paretor
pareto_bar(df_ms.index, df_ms["cancelation_rate"], df_ms.index, df_ms["c_percentage"], 
           "Market Segment", "Ratio Canceled", "% of Record Count")
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\567052389.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=x_bar, rotation = 45, ha="right")

The target audiences for our purposes appear to be "Groups", "Online TA", "Offline TA/TO". Intuitively, this makes sense that these audiences would be most likely to cancel because, from my exerpiences as a consultant and frequent flyer, these appear to be families, consultants, and lawyers. I deduce this intuitively because, looking at the other categories, Aviation are airline pilots and they are typically confined to a set number of hotels when they perform their flights. Corporate because these are most likely corporate events - logistics of setting up corporate events are too big to be canceling. Complementary because if you get reimbursed or a room is free, I would think there would be enough time buffer to allow the customer adequate planning and hence lowering the cancelation %. But the 3 main groups that comprise of ~80% of cancelations can we deduce the why? Let's hold on to this question, let's continue some exploratory analysis to see if we can generate any more valuable insights. The next few visualizations will be the same as we've seen so far, just different cuts of the data.

Is there anything useful to gain from looking at room type by cancelations? I will skip the commentary until all visualizations appear just for the sake of brevity

In [17]:
# group by room type
df_rt = df_fil_features.groupby(["assigned_room_type"]).agg({"is_canceled":"sum", "assigned_room_type":"count"})
df_rt = df_rt.rename(columns={"assigned_room_type":"record_count"})
In [18]:
# calculate the % canceled
df_rt["cancelation_rate"] = df_rt["is_canceled"] / df_rt["record_count"]

# plot cancelation ratio by room type
barh_plot("cancelation_rate", "cancelation_rate", df_rt, "Ratio Canceled", "Room Type")
In [19]:
# plot records by room type
df_rt = df_rt.sort_values("cancelation_rate", ascending=False)
barh_plot("cancelation_rate", "record_count", df_rt, "Number of Records", "Room Type")

# plot pareto of both
df_rt["c_percentage"] = df_rt["record_count"].cumsum()/df_rt["record_count"].sum()*100
pareto_bar(df_rt.index, df_rt["cancelation_rate"], df_rt.index, df_rt["c_percentage"], 
           "Room Type", "Ratio Canceled", "% of Record Count")
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\567052389.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=x_bar, rotation = 45, ha="right")

This is a similar story to Market Segment where there appears to be room types that have a high cancelation rate but are miniscule in terms of record count. Room types A, H, G, E, D are our 80/20 rule for this excercise. But, I'm not convince this is a telling feature - at least not by itself. I hypothesize that these room types are simply the most popular among the vast majority of hotels and therefore would be most likely to have cancelations due to frequency.

Let's continue on with our last categorical feature: deposit_type

In [20]:
# group by deposit type
df_dt = df_fil_features.groupby(["deposit_type"]).agg({"is_canceled":"sum", "deposit_type":"count"})
df_dt = df_dt.rename(columns={"deposit_type":"record_count"})
In [21]:
# calculate the % canceled
df_dt["cancelation_rate"] = df_dt["is_canceled"] / df_dt["record_count"]

# plot cancelation ratio by room type
barh_plot("cancelation_rate", "cancelation_rate", df_dt, "Ratio Canceled", "Deposit Type")
In [22]:
# plot records by room type
df_dt = df_dt.sort_values("cancelation_rate", ascending=False)
barh_plot("cancelation_rate", "record_count", df_dt, "Number of Records", "Deposit Type")

# plot pareto of both
df_dt["c_percentage"] = df_dt["record_count"].cumsum()/df_dt["record_count"].sum()*100
pareto_bar(df_dt.index, df_dt["cancelation_rate"], df_dt.index, df_dt["c_percentage"], 
           "Deposit Type", "Ratio Canceled", "% of Record Count")
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\567052389.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=x_bar, rotation = 45, ha="right")

I don't think this particular cut of the data is anything insightful. Perhaps it might be good in combination of other features

Continuous¶

A good method at visualizing and analyzing continuous variables is through violin plots and histograms. A violin plot is a good visualization to see distributions of a certain feature with respect to the distribution of another feature. A histogram is a tried and true method of showing the distribution of one feature. We will utilize both to further refine our dataset and answer our questions.

There are a few continuous metrics that pique my interest. The first we will start with is lead_time and see if that has any influence on cancelations.

In [23]:
# violin plot function
def violin_plot(x, y, z, df):

    # initialize our figure and axis
    f, axis = plt.subplots(figsize=(9, 9))
    
    # fix the plot to make it look nice
    axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")
    
    # Seaborn allows us to create and plot our graph in one function
    sns.violinplot(x=f"{x}", y=f"{y}", hue=f"{z}", data=df, split=True, inner="quart")
In [24]:
# Market segment view of a lead_time vs is_canceled distribution
violin_plot("market_segment", "lead_time", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

This is very interesting and it brings up the question of outliers. Notice earlier in our analysis we did not modify the dataset for outliers, and I may have even hinted that this dataset doesn't contain "outliers" because it's processed. Although it's true that this dataset has no erroneous data, it does not mean it does not contain outliers. Outliers are somewhat objective but somewhat subjective. Let's refer to our feature and interpret what it means: lead time. This is the number of days booking prior to a stay. Other than corporate events, I think anything above 1 year is, perhaps, not representative of a general population and thus will filter these records out. These records merely skew our visualizations and probably don't skew our prediction results (but we will explore this once we make actual AI models).

In [25]:
# filter out records that have bookings >= 365 days
df_fil_features = df_fil_features[df_fil_features["lead_time"] < 365].reset_index(drop=True)

# replot
violin_plot("market_segment", "lead_time", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

This is a much more usable visualization for this analysis. Let's start with our non-factors first: Undefined and Aviation. Undefined, we can ignore - it has 2 records. Aviation is interesting but not surprising. Airlines probably book a standard amount of business days ahead of scheduled flights. Fun fact: as a consultant, I worked on air cargo capacity and the industry produces a yearly "sim tape" or simulation of all projected flights for a year. This contains both cargo and passenger planes. The sim tape is merely a projection and is typically 90 - 95% accurate of what will occur for a year. But, the "actuals" don't occur until the week of. So flights aren't "locked into" schedule until the beginning of a week (Mondays in the airline industry). This is a long-winded way to say that it's not surprising Aviation hovers around the 1-7 day distribution. Makes total sense to me. But this is a combination of speculation and experience and not an interpretation of the data given at hand

I'm very surprised about Complementary. I would think that a free hotel or a free credit could be used within 1-2 years and would give a customer a longer runway for booking. But something is weird about this. Perhaps hotel credits pre-Covid were to be used within 3 months (which it appears to be the case). This would create more of a tension in customer availbility to use a credit and cancelations. I'm curious if post-Covid data would show the same pattern or if hotels have became more lenient in when a customer can use a credit.

Next. Direct, Corporate, and Online TA appear to have not as many cancelations, and the majority of cancelations occur around the same lead times as cancelations - but there are more cancelations than non-cancelations the further out in time you go. This leads us to our next group

Offline TA/TO and Groups. Both seem to have "seasonal" distributions for cancelations. Notice, as time goes up, so do cancelations, but in intervals. Take Offline TA/TO for example. This distribution of cancelations occur in 4 humps: around 1 month out, 3 months out, 6-9 months out, and roughly 1 year out. A similar pattern occurs with Groups. I don't have an explanation for Offline TA/TO or why that behavior exists - maybe it's family vacation plannings? But Groups, that makes sense. You can plan out far in advanced for groups - but we all have those flakey friends. Once again, specualtion and experience, but not a real interpretation of the data.

My real interpretation is that time generally dictates the likelihood of cancelation based off the visualization above. But this interpretation may need to be catered towards certain groups, while other groups (like aviation) get a different interpretation of the data

Let's look at this same plot but with how many weekend nights are booked in a stay

In [26]:
# plot of weekend nights
violin_plot("market_segment", "stays_in_weekend_nights", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

Once again, we deal with more outliers. 20 weekend stays is 10 weeks. That's a long stay. Let's cut the distribution to 3 (or 1-2 weeks)

In [27]:
# filter out records where the weekend night stays are above 5
df_fil_features = df_fil_features[df_fil_features["stays_in_weekend_nights"] <= 3]
In [28]:
# replot
violin_plot("market_segment", "stays_in_weekend_nights", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

This cut of the data doesn't really tell me much. I won't discard this feature as not useful, but perhaps we can repurpose this later to get a better sense of the data (like converting into a binary - did you stay a weekend or not)

Let's try this again with weekday stays

In [29]:
# plot of weekday nights
violin_plot("market_segment", "stays_in_week_nights", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

Once again, doesn't appear to be very useful - at least to how I interpret data.

Let's move on to more interesting metrics like kids, or previous cancellations

In [30]:
# plot of kids
violin_plot("market_segment", "children_and_babies", "is_canceled", df_fil_features)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

Let's cut the data to only include families with 4 or less children. Anything else appears to skew our visualizations

In [31]:
# filter out 4 or less children
df_fil_features2 = df_fil_features[df_fil_features["children_and_babies"] <= 4]
In [32]:
# replot
violin_plot("market_segment", "children_and_babies", "is_canceled", df_fil_features2)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

Wow, what a disappointing metric. I had a hypothesis that kids would be a leading cause of cancelation but that's not entirely apparent. Let's hold on to this feature - it can be useful in combination of other features and can parse out causes for cancelations with respect to other features

In [33]:
# market segment by previous cancelations
violin_plot("market_segment", "previous_cancellations", "is_canceled", df_fil_features2)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

There are some outliers in this as well. I would suspect that 1 or 2 cancelations in a lifetime is more accurate of a general population. Let's filter on those

In [34]:
# filter to 2 previous cancelations
df_fil_features2 = df_fil_features2[df_fil_features2["previous_cancellations"] <= 2]
In [35]:
# replot
violin_plot("market_segment", "previous_cancellations", "is_canceled", df_fil_features2)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

This metric is not useful and most likely will not be useful in combination of other metrics simply because it appears that the vast majority of records are within the "0" previous cancelations. I won't be using this metric any further in my analysis

Last continuous metric to analyze is booking_changes

In [36]:
# plot booking changes by market segment
violin_plot("market_segment", "booking_changes", "is_canceled", df_fil_features2)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

I would say a reasonable amount of booking changes would be around 3 per booking - even that's a lot. Let's filter out these outliers

In [37]:
df_fil_features2 = df_fil_features2[df_fil_features2["booking_changes"] <= 3]
In [38]:
# replot
violin_plot("market_segment", "booking_changes", "is_canceled", df_fil_features2)
C:\Users\wisni\AppData\Local\Temp\ipykernel_16316\2683262653.py:8: UserWarning: FixedFormatter should only be used together with FixedLocator
  axis.set_xticklabels(labels=df_dt.index, rotation = 45, ha="right")

There is not a lot of analysis to gain from this. I think I might discard this because it doesn't appear useful - even in combination of other metrics as the vast majority hover around 0 booking changes.

Let's take a look at regular histogram distributions of the data. I don't expect to answer any questions or gain any insights but just see how the data falls

In [39]:
# histograms of the data after filtering

# set color map
cmap = sns.diverging_palette(150, 50, 50, as_cmap=True)

# weekend stays
sns.displot(df_fil_features2["stays_in_weekend_nights"], color="c")
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x20afc9831f0>
In [40]:
# week day stays
sns.displot(df_fil_features2["stays_in_week_nights"], color="c")
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x20afc8c2940>
In [41]:
# lead time
sns.displot(df_fil_features2["lead_time"], color="c", kde=True)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x20af73cd2b0>

Distributions look about what I expected. For weekend nights, it makes sense that the majority do not stay a weekend, and those that do stay 1-2 days (either Sunday-Sunday or a Monday-Sunday). For weekday nights, makes sense that the vast majority stay within a 7 day stay - with 1-3 days being most popular. In terms of lead time, it appears the vast majority are done within 100 days (or ~3 months) and of that, a sizeable portion appear to be in the 0-5 day mark (each bin above is 5 days).

But these distributions just show us general information. What do these distributions look like split between canceled vs not canceled?

In [42]:
# initialize our figure and axis
f, axis = plt.subplots(nrows=6, ncols=1, figsize=(20, 40), sharex=False, sharey=False)

# get all columns
columns = ["stays_in_weekend_nights", "stays_in_week_nights", "lead_time", "children_and_babies", "booking_changes", "previous_cancellations"]

# for each column and each subplot, plot the column on each subplot
for col, ax in zip(columns, axis):
    sns.kdeplot(data=df_fil_features2, x=col, hue='is_canceled', fill=True, ax=ax)

plt.show()

These distributions tell us more information. Yes, there is no damning feature that has a perfect split between canceled and not canceled, but this distribution tells us within each feature, where more cancelations occur. This partly is starting to answer our why. Further, the lead time distribution confirms what we saw in the violin plots: the larger the lead time the more cancelations.

There is one more analysis I'd like to perform: what do these distributions look like per market segment? To save space and time, I looked at all market segments and deleted the ones that weren't interesting. Here follows the analysis:

In [43]:
# initialize our figure and axis
f, axis = plt.subplots(nrows=6, ncols=1, figsize=(20, 40), sharex=False, sharey=False)

# get all columns
columns = ["stays_in_weekend_nights", "stays_in_week_nights", "lead_time", "children_and_babies", "booking_changes", "previous_cancellations"]

# for each column and each subplot, plot the column on each subplot
for col, ax in zip(columns, axis):
    sns.kdeplot(data=df_fil_features2[df_fil_features2["market_segment"] == "Offline TA/TO"], x=col, hue='is_canceled', fill=True, ax=ax)

plt.show()
In [44]:
# initialize our figure and axis
f, axis = plt.subplots(nrows=6, ncols=1, figsize=(20, 40), sharex=False, sharey=False)

# get all columns
columns = ["stays_in_weekend_nights", "stays_in_week_nights", "lead_time", "children_and_babies", "booking_changes", "previous_cancellations"]

# for each column and each subplot, plot the column on each subplot
for col, ax in zip(columns, axis):
    sns.kdeplot(data=df_fil_features2[df_fil_features2["market_segment"] == "Groups"], x=col, hue='is_canceled', fill=True, ax=ax)

plt.show()

Interesting, now we are getting somewhere. We now know that stays in week nights/weekend nights and lead times are useful metrics to determining cancelations for two main market segments: Groups and Offline TA/TO. Offline TA/TO has about 60/40 cancelation rates on stays during the week while also confirming that lead time leads to more cancelations. Groups has a discenrable pattern between canceled and not canceled for all features. If we can't make a healthy model to generalize the entire population, we can at least make a good model for the "Groups" market segment.

Let's move on to other features

Booleans¶

Another way to look at the data is to convert continuous metrics into booleans. Look at the above distributions. There are 3 features that are awkward and skewed towards a single value. Those featuers are children_and_babies, booking_changes, and previous_cancellations. Let's turn these into binary categories: has_kids, changed_booking, canceled_previously. This will give us a better view at this data.

Note: we filtered out "outliers" above. Because we are creating more useful features, let's undo our work for most of the filtering. We will still keep the lead_time and week/weekend night stays as filters, but will get rid of the other filters. I have retroactively taken care of this by using df_fil_features2 as my dataset that had filters placed on it for children_and_babies, booking_changes, and previous_cancellations. We will simply use df_fil_features for this part because that does not include the filters on those 3 features - but still retains our filters on lead_time, week/weekend nights

In [45]:
# create boolean features
df_fil_features = df_fil_features.reset_index(drop=True)
df_fil_features.loc[df_fil_features["children_and_babies"] > 0, "has_children"] = 1
df_fil_features.loc[df_fil_features["children_and_babies"] <= 0, "has_children"] = 0

df_fil_features.loc[df_fil_features["booking_changes"] > 0, "changed_booking"] = 1
df_fil_features.loc[df_fil_features["booking_changes"] <= 0, "changed_booking"] = 0

df_fil_features.loc[df_fil_features["previous_cancellations"] > 0, "canceled_previously"] = 1
df_fil_features.loc[df_fil_features["previous_cancellations"] <= 0, "canceled_previously"] = 0

# cast the types correctly - turns into floats for whatever reason
df_fil_features["has_children"] = df_fil_features["has_children"].astype(int)
df_fil_features["changed_booking"] = df_fil_features["changed_booking"].astype(int)
df_fil_features["canceled_previously"] = df_fil_features["canceled_previously"].astype(int)

Let's look at a general view to see if any further investigation is needed

In [46]:
# initialize our figure and axis
f, axis = plt.subplots(nrows=3, ncols=1, figsize=(20, 20), sharex=False, sharey=False)

# get all columns
columns = ["has_children", "changed_booking", "canceled_previously"]

# for each column and each subplot, plot the column on each subplot
for col, ax in zip(columns, axis):
    sns.kdeplot(data=df_fil_features, x=col, hue='is_canceled', fill=True, ax=ax)

plt.show()

This actually tells us something interesting. It tells us that people are just as likely to cancel a booking before changing their booking for the first time. At a bird's eye view, this is an opportunity for hotels to try to shift their focus on persuading customers to rearrange their bookings instead of canceling.

Combinatorial¶

We've seen some useful features and some not so useful features. However, let's see what happens if we look at features with a 3D perspective. Remember, at this point, we pretty much understand the who. We are still answering the why.

In [47]:
sns.pairplot(df_fil_features, hue="is_canceled", height=3)
Out[47]:
<seaborn.axisgrid.PairGrid at 0x20af70379a0>

The above is a lot. Typically it's too much for any sort of analysis. But I like using pairplots with large amounts of features to get a high level view of correlations. Ignoring the features that were derived from other features (has_children, changed_booking, canceled_previously), we can see some delineation. See the following:

In [48]:
sns.pairplot(df_fil_features[["lead_time", "canceled_previously", "is_canceled"]], hue="is_canceled", height=3)
Out[48]:
<seaborn.axisgrid.PairGrid at 0x20a9114aa30>
In [49]:
sns.pairplot(df_fil_features[["has_children", "previous_cancellations", "is_canceled"]], hue="is_canceled", height=3)
Out[49]:
<seaborn.axisgrid.PairGrid at 0x20a88e7fe20>
In [50]:
sns.pairplot(df_fil_features[["is_repeated_guest", "previous_cancellations", "is_canceled"]], hue="is_canceled", height=3)
Out[50]:
<seaborn.axisgrid.PairGrid at 0x20a8f839cd0>

What do the above tell us?

  1. Given that you have canceled previously, you are more likely to cancel than not.
  2. Customers with multiple previous cancelations tend to not have children on the booking and will more likely cancel.
  3. Repeated guests are most likely not to cancel as opposed to first time guest

Correlations¶

In [51]:
sns.heatmap(df_fil_features[["lead_time", "stays_in_weekend_nights", "stays_in_week_nights", "children_and_babies", "is_repeated_guest", "previous_cancellations", "booking_changes", "is_canceled"]].corr())
Out[51]:
<AxesSubplot:>

Correlations are not interesting. Stays in week nights and weekend nights are not shockingly correlated. Is_canceld and lead_time are correlated (losely) - but we've already established this in prior visualizations. This is representation of the data does not tell us new information

UMAP¶

So what happens if we let algorithms and math condense the important variation defining features? I introduce UMAP which is a dimensionality reduction method and tool. UMAP is similar to t-SNE (t-distributed Stochastic Neighbor Embedding) in that UMAP identifies neighbors. Contract to other DR tools like PCA, UMAP does not explain variance by matrix factorization. Additionally, PCA works on the assumption that the first 2 (or 3) Principal Components account for the majority of the variation. In simple datasets, this works. In this dataset, it may even work. But in much larger and more complex datasets, PCA may be sub-optimal for its intended purpose.

UMAP is intended to take high dimension data and reduce it to a 2- or 3-D graph. It does this by taking a similarity score among neighbors (or clusters) and preserves these similarities while projecting onto a lower dimension. It iteratively does this for each dimension until it can project onto a desired dimension like 2- or 3-D.

Thanks StatQuest for the explanations: https://www.youtube.com/watch?v=eN0wFzBA4Sc

In the following code, I will perform a UMAP on the feature set data from the previous sections. After this analysis I will use it on the entire dataset. This is my first time using UMAP but not my first time with dimensionality reduction (have done PCA and t-SNE). I am going to be following along with the documentation (https://umap-learn.readthedocs.io/en/latest/basic_usage.html) and an example (https://plotly.com/python/t-sne-and-umap-projections/) but will repurpose these templates for my own purposes.

In [ ]:
# first, only use value columns - typically continuous variables, but UMAP will accept binary too
values = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights',
       'children_and_babies', 'is_repeated_guest',
       'previous_cancellations', 'booking_changes']

# initialize a reducer - we don't need anything special for our purposes
reducer = UMAP(random_state=42)

# fit onto the values features from our dataset
reducer.fit(df_fil_features[values])

# reduce our dataset to "projections" - these are 2D values derived from our higher dimensional dataset
projections = reducer.transform(df_fil_features[values])
In [99]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df_fil_features["is_canceled"], 
    labels={'color': 'is_canceled'},
    color_continuous_scale =["blue", "red"],
)

# plot
fig.show()

We can see a discernable pattern between canceled and not canceled. Not canceled appears in clusters within the middle of the scatter, while canceled appears as a faint ring surrounding the scattter (although also appears to have points in the middle of the scatter behind blue dots). This is interesting and can somewhat distinguish between canceled and not canceled but I'm not totally on board. We may need to include more features to make this useful

In [106]:
all_values = ['lead_time', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests']

# initialize a reducer - we don't need anything special for our purposes
reducer = UMAP(random_state=42)

# fit onto the values features from our dataset
reducer.fit(df[all_values].replace(np.NaN, 0))

# reduce our dataset to "projections" - these are 2D values derived from our higher dimensional dataset
projections = reducer.transform(df[all_values].replace(np.NaN, 0))
C:\Users\wisni\anaconda3\envs\7324env\lib\site-packages\sklearn\manifold\_spectral_embedding.py:259: UserWarning:

Graph is not fully connected, spectral embedding may not work as expected.

In [107]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["is_canceled"], 
    labels={'color': 'is_canceled'},
    color_continuous_scale =["blue", "red"],
)

# plot
fig.show()

This pattern is much more pronounced when including all features. This is useful and tells use that there is a reasonable method to distinguishing between a cancelation and non-cancelation with UMAP dimensionality reduction. Let's take this a step further. The following plots are by categorical features that I found useful. I looked at all categorical features but to keep this short, I only included the ones with discernable patterns

In [116]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["meal"], 
    labels={'color': 'meal'}
)

# plot
fig.show()

This may not be particularly useful, but there is a distinguished pattern between SC meal types and BB meal types, as SC is more centered to the scatter while BB is more of a ring. HB is a bit of both, and Undefined is also towards the center. This can actually be useful if we wanted to assumed and impute values for Undefined. We could assign these points to Purple (SC) as most of them overlap where SC dots are populated.

In [117]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["market_segment"], 
    labels={'color': 'market_segment'}
)

# plot
fig.show()

There isn't much interesting here other than Offline TA/TO being in the center, Groups following a similar pattern (but they also form a ring) and all other market segments not having a recognizable pattern.

In [118]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["distribution_channel"], 
    labels={'color': 'distribution_channel'}
)

# plot
fig.show()

This is more defined. We can see TA/TO representing the majority of ring points and center points while GDS makes up some of the center as well.

In [121]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["deposit_type"], 
    labels={'color': 'deposit_type'}
)

# plot
fig.show()

A more clear pattern with this view. No Deposits are centered and Non-Refund types surround the ring

In [124]:
# using plotly, create a scatter plot of our new data (projections) dissecting the data into canceled and not canceled
fig = px.scatter(
    projections, x=0, y=1,
    color=df["reservation_status"], 
    labels={'color': 'reservation_status'}
)

# plot
fig.show()

Finally, this shows us the No-Shows are centered focus. It also shows us that Canceled statuses are on the perimeter, but they also exist in the middle.

The previous visualizations were to show how different groups are explained with reduced dimensions. In a lot of cases, these different categorial features show discernable patterns. So how can this be useful? If one can make sense what 0 (x-axis) and 1 (y-axis) are to a C-Suite or P/VP/Director level hotel employee, then this would be useful in identifying behaviors amongst different categorical groups. As in, the reduction in dimensions can explain the behaviors of different categories. Not bad for preliminary analysis.

Final Thoughts¶

We answered the Who but the Why still needs some more addressing. We can speculate using our industry knowledge on Why cancelations may occur, but without another survery to address this specifically, we can only speculate. This exercise is still useful. We have discernable patterns that we can work a model upon. There is some noise, but not enough to deter our model. I think this gives a roadmap for high level employees in a hotel company for where to focus on creating a model and building value. It also gives food for thought on what types of programs hotels can set up for these groups of individuals to ensure that they cancel at a less rate than what's shown